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Abstract 

Accumulating evidence indicates that the capacity to integrate information in the brain is a prerequi¬ 
site for consciousness. Integrated Information Theory (IIT) of consciousness provides a mathematical 
approach to quantifying the information integrated in a system, called integrated information, <f>. Inte¬ 
grated information is defined theoretically as the amount of information a system generates as a whole, 
above and beyond the sum of the amount of information its parts independently generate. IIT predicts 
that the amount of integrated information in the brain should reflect levels of consciousness. Empirical 
evaluation of this theory requires computing integrated information from neural data acquired from ex¬ 
periments, although difficulties with using the original measure $ precludes such computations. Although 
some practical measures have been previously proposed, we found that these measures fail to satisfy the 
theoretical requirements as a measure of integrated information. Measures of integrated information 
should satisfy the lower and upper bounds as follows: The lower bound of integrated information should 
be 0 when the system does not generate information (no information) or when the system comprises inde¬ 
pendent parts (no integration). The upper bound of integrated information is the amount of information 
generated by the whole system and is realized when the amount of information generated independently 
by its parts equals to 0. Here we derive the novel practical measure d>* by introducing a concept of 
mismatched decoding developed from information theory. We show that is properly bounded from 
below and above, as required, as a measure of integrated information. We derive the analytical expression 
<f>* under the Gaussian assumption, which makes it readily applicable to experimental data. Our novel 
measure <f>* can be generally used as a measure of integrated information in research on consciousness, 
and also as a tool for network analysis in research on diverse areas of biology. 


Author Summary 

Integrated Information Theory (IIT) of consciousness attracts scientists who investigate consciousness 
owing to its explanatory and predictive powers for understanding the neural properties of consciousness. 
IIT predicts that the levels of consciousness are related to the quantity of information integrated in the 
brain, which is called integrated information $. Integrated information measures excess information gen¬ 
erated by a system as a whole above and beyond the amount of information independently generated 
by its parts. Although IIT predictions are indirectly supported by numerous experiments, validation is 
required through quantifying integrated information directly from experimental neural data. Practical 
difficulties account for the absence of direct, quantitative support. To resolve these difficulties, several 
practical measures of integrated information have been proposed. However, we found that these mea¬ 
sures do not satisfy the theoretical requirements of integrated information: first, integrated information 
should not be below 0; and second, integrated information should not exceed the quantity of information 
generated by the whole system. 

Here, we propose a novel practical measure of integrated information, designated as <f>* that satisfies 
these theoretical requirements by introducing the concept of mismatched decoding developed from infor¬ 
mation theory. $* creates the possibility of empirical and quantitative validations of IIT to gain novel 
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insights into the neural basis of consciousness. 


Introduction 

Although its neurobiological basis remains unclear, consciousness may be related to certain aspects of 
information processing mm- In particular, Integrated Information Theory of consciousness (IIT) de¬ 
veloped by Tononi and colleagues EHg predicts that the amount of information integrated among the 
components of a system, called integrated information <f>, is related to the level of consciousness of the 
system. The level of consciousness in the brain varies from a very high level, as in full wakefulness, to a 
very low level, as in deeply anesthetized states or dreamless sleep. When consciousness changes from high 
to low, IIT predicts that the amount of integrated information changes from high to low, accordingly. 
This prediction is indirectly supported by recent neuroimaging experiments that combine noninvasive 
magnetic stimulation of the brain (transcranial magnetic stimulation, TMS) with electrophysiological 
recordings of stimulation-evoked activity (electroencephalography) [TUHE]- Such evidence implies that if 
there is a practical method to estimate the amount of integrated information from neural activities, we 
may be able to measure levels of consciousness using integrated information. 

IIT provides several versions of mathematical formulations to calculate integrated information m- 
(HI■ Although the detailed mathematical formulations are different, the central philosophy of integrated 
information does not vary among different versions of IIT. Integrated information is mathematically 
defined as the amount of information generated by a system as a whole above and beyond the amount of 
information generated independently by its parts. If the parts are independent, integrated information 
will not exist. 

Despite its potential importance, the empirical calculation of integrated information is difficult. For 
example, one difficulty involves making an assumption when integrated information is calculated accord¬ 
ing to the informational relationship between past and present states of a system. The distribution of 
past states is assumed to maximize entropy, which is called the maximum entropy distribution. The 
assumption of maximum entropy distribution severely limits the applicability of the original integrated 
information measure $ indicated by [15] . First, the concept of maximum entropy distribution cannot be 
applied to a system that comprises elements whose states are continuous, because there is no unique max¬ 
imum entropy distribution for continuous variables [1511161 . Second, information under the assumption 
of the maximum entropy distribution can be computed only when there is complete knowledge about the 
transition probability matrix that describes how the system transits between states. However, the tran¬ 
sition probability matrix for actual neuronal systems is practically impossible to estimate for all possible 
states. 

To overcome these problems, Barrett and Seth m proposed using the empirical distribution estimated 
from experimental data, thereby removing the requirement to rely on the assumption of the maximum 
entropy distribution. Although we believe that their approach does lead to practical computation of 
integrated information, we found that their proposed measures based on empirical distribution m do not 
satisfy key theoretical requirements as a measure of integrated information. Two theoretical requirements 
should be satisfied as a measure of integrated information. First, the amount of integrated information 
should not be negative. Second, the amount of integrated information should never exceed information 
generated by the whole system. These theoretical requirements, which are satisfied by the original measure 
<f>, are required so that a measure of integrated information is interpretable in accordance with the 
original philosophy of integrated information, i.e., integrated information measures the extra information 
generated by a system as a whole above and beyond the amount of information independently generated 
by its parts. 

Here, we propose a novel practical measure of integrated information, $*, by introducing the con¬ 
cept of mismatched decoding developed from information theory UZHIl]- $* represents the difference 
between “actual” and “hypothetical” mutual information between past and present states of the sys- 
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tem. The actual mutual information corresponds to the amount of information that can be extracted 
about past states by knowing present states (or vice versa) when the actual probability distribution of 
a system is used for decoding information for past and present states. In contrast, hypothetical mutual 
information corresponds to the amount of information that can be extracted about past states by know¬ 
ing present states when a “mismatched” probability distribution is used for decoding where a system is 
partitioned into hypothetical independent parts. Decoding with a mismatched probability distribution is 
called mismatched decoding. <f>* quantifies the amount of loss of information caused by the mismatched 
decoding where interactions between the parts are ignored. We show here that <f>* satisfies the theoretical 
requirements as a measure of integrated information, unlike the previously proposed measures. Further, 
we derive the analytical expression of <f>* under the Gaussian assumption and make this measure feasible 
for practical computation. 


Results 

While its central ideas are unchanged, IIT updated measures of integrated information. The original 
formulation, IIT 1.0 [2j, underwent major developments leading to IIT 2.0 [6] and the latest version IIT 
3.0 [8]. In the present study, we focus on the version in IIT 2.0 00 , because the measure of integrated 
information proposed in IIT 2.0 is simpler and more feasible to calculate compared with that in IIT 
3.0 00 . 

Here, we briefly review the original measure of integrated information, $, in IIT 2.0 00 and describe 
its limitations for practical application m ■ From the concept of the original measure, we point out the 
lower and upper bounds that a measure of integrated information should satisfy. We introduce next two 
practical measures of integrated information, and <!>#, proposed by m and show that <h/ and 
fail to satisfy the lower and upper bounds of integrated information. Finally, we derive a novel measure 
of integrated information, $*, from the decoding perspective, which is properly bounded from below and 
above. 

Measure of integrated information with the maximum entropy distribution 

Integrated information is a quantity that measures how much extra information is generated by the sys¬ 
tem as a whole above and beyond the information independently generated by its parts 01 - Consider 
partitioning a system into m parts such as M \, M 2 , • ■ •, and M m and computing the quantity of informa¬ 
tion that is integrated across the m parts of a system. As detailed in Methods, the measure of integrated 
information proposed in IIT 2.0 can be expressed as follows: 

m 

$ = I( max X t ~ T -, X*) - J2 /( max M‘" T ; M*), (1) 

i=l 

where X t ~ T and X 4 are states of a system in the past t — t (t > 0) and present t, respectively. The 
distribution of past states is assumed as the maximum entropy distribution, and the upper subscript max 
is placed left of X t ~ T to explicitly indicate that the distribution of past states represents the maximum 
entropy distribution. The first term of Eq. [I] /( max j y 4_T ; A 4 ), represents the mutual information between 
the past and present states in the whole system, and the second term represents the sum of the mutual 
information between the past and present states in the *-th part of the system I( max M*~ T ■ Mf). Thus, 
<f>, the difference between them, gives the information generated by the whole system above and beyond 
the information generated independently by its parts. If the parts are independent, no extra information 
is generated, and the integrated information is 0. We can rewrite Eq. |T] in terms of entropy H as follows: 

m 

$ = J2 H C ax M t i - T \M t i ) - H( ma *X t ~ T \X t ). 

i=l 


(2) 
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To derive the above expression, we use the fact that the entropy of the whole system H{ max X t ~ T ) equals 
the sum of the entropy of the subsystems H{ max M t i ~ T ) when the maximum entropy distribution is 

assumed. 

Theoretical requirements as a measure of integrated information 

To interpret a measure of integrated information as the “extra” information generated by a system as 
a whole above and beyond its parts, it should satisfy the theoretical requirements, as follows: first, 
integrated information should not be negative because information independently generated by the parts 
should never exceed information generated by the whole. Integrated information should equal 0 when 
the amount of information generated by the whole system equals 0 (no information) or when the amount 
of information generated by the whole is equal to that generated by its parts (no integration). Second, 
integrated information should not exceed the amount of information generated by the whole system 
because the information generated by the parts should be larger than or equal to 0. In short, integrated 
information should be lower-bounded by 0 and upper-bounded by the information generated by the whole 
system. 

One can check the original measure <f> satisfies the lower and upper bounds. 

0 < $ < I( max X t ~ T -,X t ). (3) 

As shown in Methods, $ can be written as the Kullback-Leibler divergence (see Eq. l30l) . Thus, $ is 
positive or equal to 0. Further, as can be seen from Eq. [Q the upper bound of $ is the mutual information 
in the entire system, because the sum of mutual information in the parts is larger than or equal to 0. 

Practical measures of integrated information with empirical distribution 

As we described in the previous section, in the original measure 4>, the distribution of past states is 
assumed as the maximum entropy distribution, which limits the practical application of 4>. First, the 
maximum entropy distribution can be applied only when the states of a system are discrete. If the states 
are represented by discrete variables, the maximum entropy distribution is the uniform distribution over 
all possible states of X t ~ T . When the states of a system are described by continuous variables, the 
maximum entropy distribution cannot be uniquely defined PMES.. Second, the transition probability 
matrix of a system, p{X t \X t ~ T ) must be known for all possible past states X t ~ T , because the sum of 
logp(X t |X t-T ) over all possible past states must be computed for computing the mutual information 
/( ma ’ x X t ~ T ; X*). However, it is nearly impossible to estimate experimentally such a complete transition 
probability matrix in an actual neural system, because some states may not occur during a reasonable 
period of observation. Although it may be possible to force the system into a particular state by stimu¬ 
lating some neurons while silencing others and estimating transition probabilities for each state, this is 
technically extremely demanding. 

A simple remedy for the limitations of the original measure $ is to not impose the maximum entropy 
distribution on past states but to instead use the probability distributions obtained from empirical ob¬ 
servations of the system. Barrett and Seth m adopted this strategy to derive two practical measures 
of integrated information from Eqs. [I] and [2] by substituting the maximum entropy distribution with the 
empirical distribution as follows: 


X*) - J2 W‘- T ; Ml), (4) 

2=1 

m 

<f>H = H{Ml~ T \Ml) - H{X t ~ T \X t ). 

2=1 


( 5 ) 
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Matched decoding I(X*~ T \ X*'') 

vt —T 

Decoding 



p(X‘|X‘- T ) 
Mismatched decoding I*(^X^~ T ] X *) 




mX 





Integrated information <t>* = I(X* T ", X 1 ) — I*(X' T ‘,X I ) 


Figure 1 . Integrated information with an empirical distribution based on the concept of mismatched 
decoding. The figure shows a system with five neurons in which the arrows represent directed 
connectivity and the colors represent the states of the neurons (black: silence, white: firing, gray: 
unknown). The past states X t ~ T are decoded given the present states X*. The “true” conditional 
distribution p{X t \X t ~ T ) is used for matched decoding, while a “false” conditional distribution 
q(X t \X t ~ T ) is used for mismatched decoding where the parts of a system Mi and M 2 are assumed 
independent. The amount of information about past states that can be extracted from present states 
using matched and mismatched decoding is quantified by the mutual information /( X t ~ r ;X t ) and the 
“hypothetical” mutual information /*( X t ~ T ;X t ) for mismatched decoding, respectively. In this 
framework, integrated information, <f>*(X t_T ;X 4 ), is defined as the difference between /( X t ~ T -,X t ) and 
/*(X 4 - r ;X 4 ). 


Note that d>/ and $77 are not equal when the empirical distribution is used for past states, because the en¬ 
tropy of the whole system H ( X t ~ T ) is not equal to the sum of the entropy of the subsystems, JT H (M 4 ~ r ). 
<I >[f was also derived from a different perspective from IIT, i.e. the perspective of information geometry, 
as a measure of spatio-temporal interdependencies and is termed “stochastic interaction” l 20 j . 

Although these two measures appear as natural modifications of the original measure, they do not 
satisfy the theoretical requirements as a measure of integrated information. We discuss the problems of 
i>r and $77 in detail below. 

Integrated information measure based on mismatched decoding 

Here, we propose an alternative practical measure of integrated information that satisfies the theoretical 
requirements which we call d>* (phi star) (Fig. [TJ|. <!>*, which uses the empirical distribution, can be 
applied to actual neuronal recordings. Similar to $ 7 , we will derive based on the original measure $ 
in Eq. |T] based on mutual information. Given the problem of $7 in Eq. 0 we should refine the second 
term of Eq. HI while the first term, the mutual information in the whole system, is unchanged. The 
second term should be a quantity that can be interpreted as information generated independently by the 
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parts of a system and should be less than information generated by the system as a whole. 

To derive a proper second term in Eq. U we interpret the mutual information from a decoding 
perspective and introduce the concept of “mismatched decoding”, which was developed by information 
theory T? (see Methods for details). Consider that the past states A t_T are decoded given the present 
states X 4 . From the decoding perspective, the mutual information can be interpreted as the maximum 
information about the past states that can be obtained knowing the present states. To extract the max¬ 
imum information, the decoding must be performed optimally using the “true” conditional distribution, 

p{X t \X t ~ T ) = p(M{, • • • , , M'-r). (6) 

Note that the expression on the right accounts explicitly for interactions among all the parts. The 
optimal decoding can be performed using maximum likelihood estimation. In the above setting, the 
maximum likelihood estimation means choosing the past state that maximizes p(X t \X t ~ T ) given a present 
state. Decoding that uses the true distribution, p(X t \X t ~ T ), is called “matched decoding” because the 
probability distribution used for decoding matches the actual probability distribution. 

Decoding that uses a “false” conditional distribution, q(X t \X t ~ T ), is called “mismatched” decod¬ 
ing. To quantify integrated information, we consider specifically the mismatched decoding that uses the 
“partitioned” probability distribution q(X t \X t ~ T ), 

m 

q(X t \X*-T) = l[p(M t i \M t i - T ), ( 7 ) 

i—1 

where a system is partitioned into parts and the parts Mi are assumed as independent. q(X t \X t ~ T ) 
is the product of the conditional probability distribution in each part p(M‘|M* -T ). The distribution, 
q(X t \X t ~ T ), is “mismatched” with the actual probability distribution, because parts are generally not 
independent. We evaluate the amount of information obtained from mismatched decoding. As is matched 
decoding, mismatched decoding is also performed using the maximum likelihood estimation, wherein the 
past state that maximizes q(X t \X t ~ T ) is selected. The amount of information obtained from mismatched 
decoding is necessarily degraded compared with that obtained from matched decoding. The best decod¬ 
ing performance can be achieved only using matched decoding with the actual probability distribution 

P(X* |a*- t ). 

We consider the amount of information that can be obtained from mismatched decoding, I*(X t ~ T ; A 4 ), 
as a proper second term of Eq. [4] (see Methods for the mathematical expression of I*). The difference 
between /(A t_T ; A 4 ) and /*(A t_T ; A 4 ) provides a new practical measure of integrated information (Fig. 

$*(A t_r ; A 4 ) =I(X t - T -,X t )-r(X t ~ T -X t ). ( 8 ) 

<k* quantifies the information loss caused by mismatched decoding where a system is partitioned into 
independent parts, and the interactions between the parts are ignored. <f>* satisfies the theoretical re¬ 
quirements as a measure of integrated information, because I* is greater than or equal to 0 and is less 
than or equal to the information in the whole system I. <f>* defined this way is equivalent to the original 
measure $ if the maximum entropy distribution is imposed on past states instead of an empirical distri¬ 
bution (see Supporting Information for the proof). Thus, we can consider <f>* as a natural extension of 
the original measure <I> to the case when the empirical distribution is used. 

Analytical computation of $* using Gaussian approximation 

Although using an empirical distribution instead of the maximum entropy distribution makes integrated 
information more feasible to calculate, it is still difficult to compute <f>* in a large system, because the 
summation (or integral) over all possible states must be calculated. The number of all possible states 
grows exponentially with the size of the system and therefore, computational costs for computing <f>* also 
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grow exponentially. Thus, for practical calculation of $*, we need to approximate <F* in some way such as 
approximating the probability distribution of neural states using the Gaussian distribution m • <F* can 
be analytically computed using the Gaussian approximation (see Methods). The Gaussian approximation 
reduces significantly the computational costs and makes <f>* practically computable even in a large system. 

Theoretical requirements are not satisfied by previously proposed measures 

As described above, the lower and upper bounds of integrated information should equal 0 and the infor¬ 
mation generated by the whole system, respectively. In this section, by considering two extreme cases, 
we demonstrate that the previously proposed measures <&h and m do not satisfy either the lower or 
upper bound. 

When there is no information 

First, we consider cases where there is no information between past and present states of a system, i.e. 
I(A t_T ; A 4 ) = 0. In this case, integrated information should be 0. As expected, <b* and are 0, because 
the amount of information for mismatched decoding, I*(X t ~ T ; A 4 ), and the mutual information in each 
part, Mf) are both 0 when J(A*- r ; A‘) = 0. 

$* = 0 , 

= 0 . 

However, <&h is not 0. can be written as 

<S> H =Y J H(M t ~ T )-H( A*- r ). (11) 

i 

$>h is not 0 when the information is 0 because is not based on the mutual information but on the 
conditional entropy (see Eq. [5]). Therefore, does not necessarily reflect the amount of information in 
a system. 

As a simple example that shows the above problem of consider the following linear regression 
model, 

A‘ =A-X t ~ 1 +E t . (12) 

Here, A is the state of units, A is a connectivity matrix, and E t is multivariate Gaussian noise with zero 
mean and covariance Y,(E). E* is uncorrelated over time. For simplicity, consider a system composed of 
two units (the following argument can be easily generalized to a system with more than two units). We 
set the connectivity matrix, A , and the covariance matrix of noise, E(.E) as follows: 


A = a ■ ^ 

' 1 

:)■ 

(13) 

E (E) = | 


?)• 

(14) 


where a and c are parameters that control the strengths of connections and noise correlation, respectively. 
We compute measures of integrated information using the above model. The time difference r is set to 1. 
We assume that the prior distribution of the system is the steady state distribution, where the covariance 
of past states, E(A t_1 ), and that of present states, E(A 4 ), are equal, i.e. E(A t_1 ) = E(A 4 ) = E(A). 
The covariance of the steady state distribution E(A) can be calculated by taking the covariance of both 
sides of Eq. [12] 


(9) 

( 10 ) 


E(A) = AE(A)A T + E (E). 


(15) 


#1 


11=0 

<— 



Figure 2. Exemplar time series when the strength of noise correlation c and the connection strength a 
are set to 0.9 and 0, respectively in the linear regression model (Eq. IT 2 |) . I\ and I 2 represent the mutual 
information in units 1 and 2. Because there is no connection, there is no information between past and 
present states of the system: I\ and I 2 are both 0. In this case, <f>* and $7 are 0 as they should be, yet 
$77 is positive. 


We consider a case where the connection strength a is 0. Fig. [2] shows an exemplar time series when 
the strength of noise correlation c is 0.9. Because there are no connections, including self-connections 
within each unit, each unit has no information between past and present states, i.e., I\ = I 2 = 0. As can 
be seen from Fig. [2j however, the two time series correlate at each moment because of the high noise 
correlation. 

We varied the degree of noise correlation, c, from 0 to 1 while keeping the connection strength a as 
0 (Fig. HA)). $* and $7 stay 0 independent of noise correlation. However, an entropy-based measure, 
$ 77 , increases monotonically with c, irrespective of the amount of information in the whole system (Fig. 
HA)). In other words, $77 does not reflect the amount of information in a system, but does reflect the 
degree of correlation between the parts. As shown in Eq. [Tl] $77 is the difference between the sum of 
entropy within each part and entropy in the whole system. When the parts correlate, the entropy in the 
whole system decreases. In contrast, the sum of entropy of each part does not change, because the degree 
of noise within each part (the diagonal elements of E*) is fixed. Thus, "f >77 increases as the degree of noise 
correlation c increases. 

Because &h is not 0 even when there is no information in the system, we can see that it exceeds the 
mutual information in the whole system and does not satisfy the upper bound as a measure of integrated 
information. 

When parts are perfectly correlated 

Next, we consider a case where the parts are perfectly correlated. More specifically, consider the case 
where the two parts Mi and M 2 are equal at every time, i.e. M[~ T = M^~ r = M t ~ T and M[ = M% = M l . 
Here, is 0 because the amount of information extracted by mismatched decoding would not degrade 
even if the other part is ignored for decoding (see Supporting Information for the mathematical proof). 


$* = 0 . 


(16) 
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Figure 3. Theoretical requirements as a measure of integrated information are not satisfied by $77 and 
$ 7 . The behaviors of $*, $ 7 , and $77 are shown in the left, middle, and right panels, respectively, when 
the strength of noise correlation c is varied in a linear regression model (Eq. fl2l) . Red lines indicate the 
regime where the theoretical requirement is violated, and the blue lines indicate that the theoretical 
requirement is satisfied. Dotted black lines are drawn at 0. (A) Violation of the upper bound. The 
strength of connections a is set to 0. In this case, there is no information between past and present 
states of the system but $77 is not 0, i.e., $77 violates the upper bound. (B) Violation of the lower 
bound. The strength of connections a is set to 0.4. At the right ends of the figures where c is 1, the two 
units in the system are perfectly correlated. $7 is negative, i.e., violates the lower bound when the 
degree of correlation is high. 


Regarding $7, when the parts are perfectly correlated, the mutual information of each part is equal 
to each other, = I(M 2 _T ;M|) = I(M t ~ T ;M t ) and the mutual information in the whole 

system is equal to the mutual information of each part, I(X t ~ T ] X*) = I(M t ~ T ;M t ). Thus, the second 
term in Eq. [I] is twice the value of the first, and $7 is the negative value of the mutual information in 
one part, 

$7 = -I(M*“ T ;M t ). (17) 

Thus, $7 does not satisfy the lower bound as a measure of integrated information. $77 is given by 

$h = H{X t ~ T \X t ) - 2H(M t ~ T \M t ), (18) 

which is larger than or equal to 0 ($77 is always larger than or equal to 0 because it can be written as 
the Kullback-Leibler divergence.). 

To illustrate the behaviors of these three measures of integrated information when the degree of 
correlation varies, we considered the same linear regression model presented in the previous section (Eq. 
E2). We varied the degree of noise correlation, c, from 0 to 1 while keeping connection strength a as 0.4. 
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Il=0.33 



1=0.51 


Figure 4. Exemplar time series when the strength of noise correlation c and the connection strength a 
are set to both 0.4 in the linear regression model (Eq. 1121) . I\ and I 2 represent the mutual information in 
units 1 and 2, and I represents the mutual information in the whole system. In this case, the sum of the 
mutual information in the parts exceeds the mutual information in the whole system and $7 is negative. 


When c is 1, the two units correlate perfectly. Fig. [4] shows an exemplar time series when c is 0.4 and 
a is 0.4. takes positive values when c is less than ~ 0.2 but takes negative values when c is greater 
(FigElB)). $* decreases monotonically with c and becomes 0 when c is 1. $77 increases monotonically 
with c reflecting the degree of correlation between the units. The detailed behaviors of $*, $/ and 
when a and c are both varied are shown in Supporting Information. 


Discussion 

In this study, we consider two theoretical requirements that a measure of integrated information should 
satisfy, as follows: the lower and upper bounds of integrated information should be 0 and the amount 
of information generated by the whole system, respectively. The theoretical requirements are naturally 
derived from the original philosophy of integrated information mi, which states that integrated infor¬ 
mation is the information generated by a system as a whole above and beyond its parts. The original 
measure of integrated information $ satisfies the theoretical requirements that are required so that we can 
interpret a measure of integrated information according to the original philosophy. To derive a practical 
measure of integrated information that satisfies the required lower and upper bounds, we introduced a 
concept of mismatched decoding. We defined our measure of integrated information <£* as the amount 
of information lost when a mismatched probability distribution, where a system is partitioned into “in¬ 
dependent” parts, is used for decoding instead of the actual probability distribution. In this framework, 
<f>* quantifies the amount of information loss associated with mismatched decoding where all interactions 
between the parts of a system are ignored and therefore quantifies the amount of information integrated 
by such interactions between the parts. We show that <f>* satisfies the lower and upper bounds, that 
$7 does not satisfy the lower bound, and that $77 does not satisfy the upper bound. We consider $* a 
proper measure of integrated information that can be generally used for practical applications. 

The basic concept of Integrated Information Theory (IIT) was tested by conducting empirical experi¬ 
ments, and the evidence accumulated supports the conclusion that when consciousness is lost, integration 
of information is lost [T0Hl4j . In particular, Casali and colleagues [14] found that a complexity measure, 
motivated by IIT, successfully separates conscious awake states from various unconscious states due to 
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deep sleep, anesthesia, and traumatic brain injuries. Although their measure is inspired by the concept 
of integrated information, it measures the complexity of averaged neural responses to one particular type 
of external perturbation (e.g. a TMS pulse to a target region) and does not directly measure integrated 
information. 

There are few studies that directly estimate integrated information in the brain [2T]122j using the 
measure introduced in IIT 1.0 [2] or $#. Our new measure of integrated information, <£>*, will contribute 
to experiments designed to test whether integrated information is a key to distinguishing conscious states 
from unconscious states [23II25] . 

We considered the measure of integrated information proposed in IIT 2.o eiei, because its computa¬ 
tions are feasible. There are several updates in the latest version, IIT 3.0 j8j. One important update is 
that both the cause and effect of a present state are considered for quantifying integrated information. In 
IIT 2.0, integrated information is quantified by measuring how the distribution of past states differs when 
a present state is given, i.e. only the cause of a present state is considered (see Methods). Moreover, IIT 
3.0 measures how the distribution of future states differs when a present state is given, i.e. the effect 
of a present state is considered. Our measure <f>* does not asymmetrically treat the past cause and the 
future effect when a present state is given, because the mutual information is a symmetric measure for the 
times t — r and t. An unanswered question is how integrated information should be practically calculated 
taking cause and effect into account separately, using an empirical distribution. 

An unresolved difficulty that impedes practical calculation of integrated information is how to partition 
a system. In the present study, we considered only the quantification of integrated information when a 
partition of a system is given. IIT requires that integrated information should be quantified using the 
partition where information is least integrated, called the minimum information partition (MIP) [3j[6|. 
To find the MIP, every possible partition must be examined, yet the number of possible partitions grows 
exponentially with the size of the system. One way to work around this difficulty would be to develop 
optimization algorithms to quickly find a partition that well approximates the MIP. 

Besides the practical problem of finding the MIP, there remains a theoretical problem of how to com¬ 
pare integrated information across different partitions. Integrated information increases as the number 
of parts gets larger, because more information will be lost by partitioning the system. Further, inte¬ 
grated information is expected to be larger in a symmetric partition where a system is partitioned into 
two parts of equal size than in an asymmetric partition. IIT 2.0 [5] proposes a normalization factor, 
which considers these issues. However, there might be other possible ways to perform normalization. It 
is unclear whether there is a reasonable theoretical foundation that adjudicates the best normalization 
scheme. Moreover, it is unclear if the normalization factor, which was proposed under the assumption 
that the states of a system are represented by discrete variables, would be appropriate for the cases where 
the states are represented by continuous variables. Further investigations are required to resolve practical 
and theoretical issues related to the MIP. 

Although we derived <f>*, because we were motivated by IIT and its potential relevance to conscious¬ 
ness, has unique meaning from the perspective of information theory, which is independent of IIT. 
Thus, it can be applied to research fields other than research on consciousness. <&* quantifies the loss of 
information when interactions or connections between the units in a system are ignored. Thus, can be 
expected to be related to connectivity measures such as Granger causality |2B] or transfer entropy E3- It 
will be interesting to clarify mathematical relationships between $* and the other connectivity measures. 
Here, we indicate only an apparent difference between them as follows: intends to measure global 

integrations in a system as a whole, while traditional bivariate measures such as Granger causality or 
transfer entropy intends to measure local interactions between elements of the system. Consider that 
we divide a system into parts A, B, and C. Using integrated information, our goal is to quantify the 
information integrated among A, B , and C as a whole. In contrast, what we quantify using Granger 
causality or transfer entropy analysis is the influence of A on B, B on C, C on A and the reverse. It 
is not obvious how a measure of global interactions in the whole system should be defined and derived 
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theoretically from measures of local interaction. As an example, one possibility is simply summing up all 
of local interactions and considering the sum as a global measure [ 3S8] . Yet, more research is required to 
determine whether such an approach is a valid method to define global interactions. $*, in contrast, is not 
derived from local interaction measures but is derived directly by comparing the total mutual information 
in the whole system with hypothetical mutual information when the system is assumed to be partitioned 
into independent parts. Thus, the interpretation of <f>* is straightforward from an information theoretical 
viewpoint. Our measure, which we consider a measure of global interactions, may provide new insights 
into diverse research subjects as a novel tool for network analysis. 


Methods 

Intrinsic and extrinsic information 

Before introducing the concept of integrated information, we clarify the definition of “information” in IIT. 
In IIT, information always refers to intrinsic information in contrast to extrinsic information [8J. Intrinsic 
and extrinsic here refers to the perspective from which information is considered. Intrinsic information 
is quantified from the perspective of the system itself while extrinsic information is quantified from the 
perspective of an external observer. In this section, we explain the differences in detail. 

In neuroscience, many researches focus on quantifying the informational relationship between neural 
states and external stimuli or observable output behaviors [29H32] . For example, the mutual information 
between neural states A' and external stimuli S is quantified as 

I(X-S)=H{S)-H(S\X). (19) 

where the entropy H(S) and the conditional entropy H(S\X) are given by 

H(S) = - ^2 p(s) log p(s), 

S 

H{S\X ) = ~^2p(s,x) logp(s|x) 

x } s 

Here, x and s represent a particular neural state and a particular external stimulus, respectively, with 
p(x), p(s), p(s,x), and p(s|x) denoting the probability of x and s, the joint probability of x and s, and 
a conditional probability of s given x. The sum is calculated for all possible neural states x or over all 
stimuli s. The capital S and X represent an entire set of s or x, respectively. When we assume that 
continuous variables represent neural states, we must replace the sum )T) with the integral f. As shown 
in Eq. El mutual information is expressed as the difference between the entropy of stimuli, H(S), and 
the conditional entropy of stimuli given neural states, H(S\X). Thus, I(X\S ) quantifies the reduction 
of uncertainty about stimuli by acquiring knowledge of neural states from the perspective of an external 
observer, i.e. to what extent can an external observer know about external stimuli by observing neural 
states. This type of information is called extrinsic information because the information is quantified from 
an external observer’s point of view. 

Intrinsic information, in contrast, is quantified from the viewpoint of the system itself, independent of 
observations by any other external entity [8] . Intrinsic information should not depend on external variables 
but only on internal variables of the system. If information concerns consciousness, it is considered 
intrinsic information, because consciousness is independent of external observers. With this concept of 
intrinsic information, IIT aims to quantify how much “difference” the internal mechanisms of a system 
makes for the system itself, i.e. the degree of influence a system exerts on itself through its internal 
causal mechanisms. How the past states would affect present states can be determined by the transition 
probability matrix of the system, p(X*\X t ~ T ), which specifies probabilities according to which any state 


( 20 ) 

( 21 ) 
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of a system transits to any other state. Here, X * and X t ~ T are states of the system at times t and 
t — r, which we call present and past states, respectively. ITT quantifies intrinsic information using the 
transition probability matrix. 

The intrinsic information proposed in IIT 2.0 quantifies to what extent the mechanisms of the system 
make the posterior probability distribution of past states given a present state different compared with 
a prior distribution of past states. The posterior probability distribution of past states given a present 
state represents the likelihood of potential causes of the given present state. Intrinsic information in IIT 
2 .0, which is called “effective information”, is defined as the difference between the posterior probability 
distribution, p(X t ~ T \x t ), and a prior distribution of past states, p{X t ~ T ) as follows: 

ei(* t ) = D KL (p(X‘-V)lb(^- T )) , (22) 

where DKL,{p{X)\\q{X)) is the Kullback-Leibler divergence, which measures the distance between the 
two probability distributions p and q and is given by 

D KL (p(X)\\q(X)) = 5>(a;) log^. (23) 

If there are no causal mechanisms within the system, present states are not affected by past states. Thus, 
the posterior distribution of past states does not differ from the prior distribution. IIT interprets the 
degree of the “difference” made in the posterior probability distribution of past states according to its 
internal mechanisms, as information generated intrinsically within the system. Note that while intrinsic 
information is based on an intrinsic property of the system, it does not mean that it cannot be quantified 
by an external observer. 

To quantify intrinsic information, in addition to the transition probability matrix, a prior distribution 
of past states must be specified. Although the transition probability matrix is determined by the intrinsic 
mechanisms of a system, a prior distribution of past states cannot be uniquely determined. There are 
many possible methods to choose a prior distribution from different standards. For example, in the 
context of channel capacity in information theory, the prior distribution that maximizes information may 
be selected [16J . In contrast, IIT selects the maximum entropy distribution as a prior distribution mm- 
If a system’s states are represented as a set of discrete variables, the maximum entropy distribution is the 
uniform distribution over all possible past states X t ~ T . Thus, using the maximum entropy distribution 
as a prior distribution means that every possible past state is equally likely as a cause of a present state. 

Although the maximum entropy distribution can be uniquely defined for discrete variables, this is not 
possible for continuous variables |151I16| . If some constraints are given, the maximum entropy distribution 
can be defined for continuous variables. For example, under the constraints that the mean and the 
variance of the variables are fixed at specific values, the Gaussian distribution with the specified mean 
and variance is the maximum entropy distribution. There is no principle that determines what types of 
constraints should be imposed and how the maximum entropy distribution should be uniquely determined 
for continuous variables. Thus, intrinsic information (and integrated information) defined in IIT 2.0 can 
be applied only to discrete variables. 

Using entropy, Eq. [22] can be written as 

ei(x*) = i?(p( max X t - T )) - H(p( max X t ~ T la: 4 )), (24) 

where the upper subscript max placed on the left side of X t ~ T is a reminder that the distribution of X t ~ T 
is the maximum entropy distribution. Eq. [M] provides another interpretation of effective information. It 
quantifies to what extent uncertainty of the past states X t ~ T (the entropy, H( max X t ~ T )) can be reduced 
by knowing a particular present state x l from the system’s intrinsic point of view. Using Bayes’ rule, the 
posterior distribution, p( max X t ~ T \x t ), can be calculated as 

/max vt—r | p(* t | m “X*- T ) P ( m “X‘- T ) 

P[ 1 p(x*) 


(25) 
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Averaging ei(x t ) over all possible present states x*, the averaged effective information equals the 
mutual information between past states and present states, 

EI = '^2,p(x t )ei{x t ), (26) 

X t 

= H(p max (X t ~ T )) - H(p{ max X t ~ T |X*)), (27) 

= I{ Taax X t ~ T -X t ). (28) 

While effective information is originally quantified in a state-dependent manner as in Eq. (55] (with a 
particular present state, a;*), we consider only the averaged effective information in Eq. [28] (with an 
entire set of present states, X 4 ) following the previous study US- 


Integrated information 

Integrated information is the quantity that measures the information generated by the system as a 
whole above and beyond the information generated independently by its parts PS- As performed when 
computing information, integrated information is computed between the system’s past X t ~ T and present 
states A'*. Consider partitioning a system into m parts such as Mi, M 2 , • • •, and M m and computing the 
amount of information that is integrated across m parts. Quantifying integrated information is equivalent 
to quantifying the amount of information lost by partitioning the system. In IIT, partitioning into m 
parts corresponds to splitting the transition probability matrix p(X i \X t ~ T ) into the product of each 
transition probability matrix in the parts p(M||M 4_T ). The partitioned transition probability matrix, 
q(X t \X t ~ T ), can be written as 

m 

q(X t \X t ~ T ) = l[p(M t l \M t l -T ) . (29 ) 

i=i 

Integrated information, 4>{x t ), proposed in IIT 2.0 is defined as the difference between the posterior 
probability distribution of past states given a present state in the intact system, p( max A' <-T |x t ) and that 
in the “partitioned” system, q{ max X t ~ T \x t ) is as follows: 

#r‘) = D kl (p( max A^Vjllgr-A^-V)) , (30) 

where Dkl is the Kullback-Leibler divergence defined in Eq. [23] and past states are assumed as the 
maximum entropy distribution. q( ma,x X t ~ T \x t ) is defined as follows: 

n (rJ- 1max Vt-rw/max yt-r\ 

q( max X t ~ T \x t ) = g0r| X ,\ q } X \ (31) 

where q(x t ) = '^2x t ~ T Q{x t \X t ~ T )q( ulax X t ~ T ) and g( max X t ~ T ) is the maximum entropy distribution. 
Integrated information defined in Eq. [3(1] quantifies the difference in the posterior probability distribution 
of past states given a present state, if the parts of the system are forced to be independent. 

Although the original integrated information measure (f>(x 4 ) is defined for a particular present state a: 4 , 
we consider only the average of (j){x t ) over all possible states as is performed for quantifying information 
in the previous section. The averaged integrated information $ can be calculated as follows: 


X 1 


= Y^P^Dkl (p( max A' t_T |a: t )||g( max A t_r |a; 4 )) , 

X* 


= ^p(:r 4 ) p( ma V" r |a: 4 )log 

x t X t ~ T 


K 

«( 


max 


max 


x t ~ T 



(32) 

(33) 


(34) 
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Using Eg 1291 and RTTI we can write $ in terms of entropy as follows: 

m 

$ = ^ff( max M‘- r |M‘) - #( max x‘- T \X t ). (35) 

i=i 

As shown in Eq. 1351 integrated information measures the difference between the uncertainty of past 
states given present states in the intact system and that in the partitioned system. The uncertainty of 
the partitioned system is always larger than that of the intact system and the increase in uncertainty 
corresponds to the loss of information caused by partitioning. We can rewrite Eq. [35] in terms of mutual 
information as follows: 

m 

$ = /( max A t - r ;X t ) - (36) 

i-1 

where we use the fact that the entropy of the whole system H( max X t ~ T ) is the same as the sum of the 
entropy of the subsystems i7( max M* _T ) when the maximum entropy distribution is assumed. 

Quantitative meaning of I and I* in information theory 

In this section, we briefly review the quantitative meaning of mutual information I in information theory 
and that of its extension to mismatched decoding /*, which was developed by Merliav et al. [17j (see 
also nsumm). Consider information transmission over a noisy channel p(Y\X) where X is the input 
and Y is the output. For simplicity, assume that X and Y are both 0 or 1. (In the Results section, we 
consider the case where X and Y are the past and present states of a system, X t ~ T and A*, respectively, 
and the states of a system are multidimensional variables but the same arguments as described below are 
generally applicable to such a case.) The sender transmits a sequence of X with length N called a code 
word, c = [Ai, A 2 , • ■ • , Ajv], over the noisy channel. For binary inputs, there are 2 N possible code words, 
but the sender does not transmit them all. A set of the code words transmitted over the noisy channel is 
called a codebook. The codebook is shared between the sender and the receiver. The transmitted code 
word is disturbed by the noise that depends on p(Y |A) and is changed to d = [Yj, Y 2 , ■ • • , Yjv], where 
Yj is the output of AThe job of the receiver is to infer (decode) which code word is sent from the 
received message c'. Consider the question as follows: For the receiver to decode the message “error-free” 
(more precisely, with an infinitesimally small error with limits of N —> 00), how many code words can 
the sender transmit, or how many code words can the codebook contain? 

Shannon’s noisy channel coding theorem answers this question. According to the noisy channel coding 
theorem, the mutual information determines the upper limit of the number of code words that can be 
sent error-free over a noisy channel. We denote the maximal number of code words that can be sent 
error-free over the noisy channel by 2 RN , where R is called the information transfer rate and is less than 
or equal to 1. The information transfer rate R is given by the mutual information / between A and Y, 

R = I(X;Y). (37) 

To achieve the maximal information transfer rate given by the mutual information, the receiver must 
optimally decode a message, which can be performed using the maximum likelihood estimation. The 
maximum likelihood estimation means choosing the code word c in the codebook that maximizes the 
likelihood p(c'|c), 

p(c'\c)=l[p(Y i \X i ). (38) 

i 

Note that the optimal decoding scheme uses the actual probability distribution p(Y |A). This type of 
decoding is called matched decoding, because the probability distribution used for decoding is matched 
with the actual probability distribution. If a mismatched probability distribution q(Y\X), which is 
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different from the actual probability distribution p(Y\X), is used for decoding instead, the information 
transfer rate necessarily degrades. The information transfer rate R* for a mismatched decoding is given 

by r, 

R* = I*(X-Y). (39) 

As in matched decoding, decoding is performed using the maximum likelihood estimation with the fol¬ 
lowing “mismatched” likelihood function g r (c / |c). 


q(c'\c) = l[q(Y i \X i ). (40) 

i 

I*(X;Y) is an extension of the mutual information I(X;Y) in the sense of the information transfer rate 
over a noisy channel p(Y\X) when a mismatched distribution q(Y |X) is used for decoding. 

The information transfer rate determines the amount of information that can be obtained from a 
message. The receiver obtains more information from a message when the information transfer rate 
increases. The mutual information /, which is equivalent to the maximal information transfer rate, 
determines the maximum amount of information that can be obtained by matched decoding. /*, in 
contrast, determines the amount of information that can be obtained by a mismatched decoding. 

Mathematical expression of I* 

The amount of information for mismatched decoding can be evaluated using the following equation, 

I* {X*~ T \ X*) = ~Y J P{X t ) log Y, 
x* x*-' 

+ Y P{X t ~ T ,X t )\ogq{X t \X t - T f, (41) 

X‘-t,X‘ 

where f3 is the value that maximizes I*. The maximization of I* with respect to beta is performed by 
differentiating I* and solving the equation, dl*((3)/d(3 = 0. In general, the solution of the equation can 
be found using the standard gradient ascent method, because I* is a convex function with respect to 

/? Hang. 

For comparison, the mutual information is given by 
HX'-T-Xt) = ~Yp( X *) log-P( X ‘) + E P^-^X^logpiX^X^). (42) 

X* X‘- T ,X* 

If a mismatched probability distribution q{X t \X t ~ T ) is replaced by the actual distribution p(X t \X t ~ T ) 
in Eq. [4ll the derivative of I* becomes 0 when /3 = 1. By substituting q = p and (3 = 1 into Eq. [4ll one 
can check that I* is equal to I in Eq. |42j as it should be. The amount of information for mismatched 
decoding, /*, was first derived in the field of information theory as an extension of the mutual information 
in the case of mismatched decoding m- I* was first introduced into neuroscience in [18| and was first 
applied to the analysis of neural data by [TS]. However, I* in the prior neuroscience application [T51IT0] 
was quantified between stimuli and neural states, not between past and present states of a system, as 
described in the present study. 

Analytical computation of under the Gaussian assumption 

Assume that the probability distribution of neural states x is the Gaussian distribution, 

P ( x ) = 77 — N . n 1/2 exp f ~ (x - x) t E(A)- 1 (x - x)) . (43) 

((2 7 r) JV |E(A)|) 1/2 V 2 / 
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where N is the number of variables in x, x is the mean value of x, and E(X) is the covariance matrix of 
x. The Gaussian assumption allows us to analytically compute $*, which reduces substantially the costs 
for computing <f>*. When X 4 ~ r and X 4 are both multivariate Gaussian variables, the mutual information 
between X t ~ T and X 4 , J(X 4_T ;X 4 ), can be analytically computed as 


I(X t ~ T ;X t ) 


Iiog_ffi!HlL 

2 & |E(X 4 ^|X 4 )|’ 


(44) 


where E(X 4 T \X t ) is the covariance matrix of the conditional distribution, p(X 4 T \X t ), which is ex¬ 
pressed as 

E(X 4 ~ r |X 4 ) = E(X 4 " r ) - E(X 4 “ r , X t )E(X t )- 1 E(X t - T , X 4 ) T , (45) 

where Y,(X t ~ T , X*) is the cross covariance matrix between X 4_T and X 4 , whose element E(X 4_T , X 4 )^- 
is given by cov(X-~ T , X 4 ). 

Similarly, we can obtain the analytical expression of I* as follows: 


(/3) = ±Tr (E(X 4 )i?) + i log (|Q||E(X t_T )|) - 


(46) 


where Tr stands for trace. Q and R are given by 


H—T\ — 1 


Q = E(X t " r )- i +/lE i3 (X 4 - r )- i E I) (X t ,X t - T ) J E i3 (X 4 |X t ^ r )- i E I) (X t ,X t - T )E D (X t - T ) 


(47) 


R = /lE£)(X t |X t_r ) _1 

- /3 2 Ei,(X 4 |X 4 - r )- 1T E D (X 4 , X 4 - r )E D (X 4 - r )- 1 Q- 1 E D (X 4 - r 


1 E r) (X 4 ,X 4 - r ) T E r) (X 4 |X 4 - r )- 1 , 

(48) 


where Sd(X 4 -), E£)(X 4 ,X 4 t ) and E£>(X 4 |X 4 T ) are diagonal block matrices. Each block matrix is 
a covariance matrix of each part, E(M 4-r ), Eand E(M 4 |M 4_T ) where Mj is a subsystem. 
For example, E£>(X 4_r ) is given by 


£d(X 4 -") 


/ E(M 4 “ r ) 

S(M*- T ) 

0 

v 


\ 

0 

nw'-n j 


(49) 


The maximization of I* with respect to /3 is performed by solving the equation dl* ((3)/d(3 = 0. The 
derivative of I* (/3) with respect to /3 is given by 


dI*{P) 

d/3 




+ ^ Tl ' 



N 

T’ 


(50) 


where 


^ = E d (X 4 |X 4 ^)" 1 

- 2/3E d (X 4 |X 4 - t )- 1t Ed(X 4 , X 4 - t )Ed(X 4 - t 

- /3 2 E I) (X 4 |X 4 - r )- 1T E i3 (X 4 , X 4 - T )Ei)(X 4 - r )- 


)- 1 Q- 1 E D (X t - T )- 1 E D (X 4 ,X t - T ) T E D (X t |X 4 - r )- 1 

1 ^^z D (x t - T )- 1 z D (x t ,x t - T ) T z D (x t \x t - T )-\ 


(51) 
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and 


dQ 

dp 


E B (X t -^)- 1 E D (X t ,X t - T ) T E J3 (X t |X t ^)- 1 E D (X*,X*- r )Sz ) (X*- T )- 1 , 


dQ- 1 

d/3 


— Q— 1 ——Q— 1 , 

* d/3 * ’ 

- Q- 1 Y, D (X t - T )- 1 Y, D /X t ,X t - T ) T Y, D /X t \X t - T 


~ 1 Ti D [X t , X t - T )Y> D /X t ~ T )- 1 Q- 1 . 


(52) 


(53) 

(54) 


Inspection of the above equations reveals that dI*{/3)/d/3 = 0 is a quadratic equation with respect to 
p. Thus, P can be analytically computed without resorting to numerical optimization such as gradient 
ascent. 


Acknowledgments 

M.O. was supported by a Grant-in-Aid for Young Scientists (B) from the Ministry of Education, Culture, 
Sports, Science, and Technology of Japan (26870860). N.T. was supported by Precursory Research for 
Embryonic Science and Technology from Japan Science and Technology Agency (3630), Future Fellowship 
(FT120100619) and Discovery Project (DP130100194) from Australian Research Council. 


References 

1. Chalmers DJ (1995) Facing up to the problem of consciousness. Journal of consciousness studies 
2, 200-219. 

2. Tononi G (2004) An information integration theory of consciousness. BMC Neurosci 5, 42. 

3. Tononi G (2008) Consciousness as integrated information: a provisional manifesto. Biol Bull, 215, 
216-242. 

4. Tononi G (2010) Information integration: its relevance to brain function and consciousness. Arch 
Ital Biol 148, 299-322. 

5. Tononi G (2012) Integrated information theory of consciousness: an updated account. Arch Ital 
Biol 150, 56-90. 

6 . Balduzzi D, Tononi G (2008) Integrated information in discrete dynamical systems: Motivation 
and theoretical framework. PLoS Comput Biol 4, el000091. 

7. Balduzzi D, Tononi G (2009) Qualia: the geometry of integrated information. PLoS Comput Biol 
5, el000462. 

8 . Oizumi M, Albantakis L, Tononi G (2014) From the Phenomenology to the Mechanisms of Con¬ 
sciousness: Integrated Information Theory 3.0. PLoS Comp Biol 10, el003588. 

9. Tononi G, Koch C (2015) Consciousness: here, there and everywhere? Phil Trans R Soc B 19, 
370. 

10. Massimini M, Ferrarelli F, Huber R, Esser SK, Singh H, Tononi G (2005) Breakdown of cortical 
effective connectivity during sleep. Science 309, 2228-32. 



19 


11. Massimini M, Ferrarelli F, Esser SK, Riedner BA, Huber R, Murphy M, Peterson MJ, Tononi G 
(2007) Triggering sleep slow waves by transcranial magnetic stimulation. Proc Natl Acad Sci USA 
104, 8496-501. 

12. Ferrarelli F, Massimini M, Sarasso S, Casali A, Riedner BA, Angelini G, Tononi G, Pearce RA 
(2010) Breakdown in cortical effective connectivity during midazolam-induced loss of consciousness. 
Proc Natl Acad Sci USA 107, 2681-2686. 

13. Rosanova M, Gosseries O, Casarotto S, Boly M, Casali AG, Bruno MA, Mariotti M, Boveroux P, 
Tononi G, Laureys S, Massimini M (2012) Recovery of cortical effective connectivity and recovery 
of consciousness in vegetative patients. Brain 135, 1308-20. 

14. Casali AG, Gosseries O, Rosanova M, Boly M, Sarasso S, et al. (2013) A theoretically based index 
of consciousness independent of sensory processing and behavior. Science translational medicine 5 
(198): 198ral05.198ral05. 

15. Barrett AB, Seth AK (2011) Practical measures of integrated information for time-series data. 
PLoS Comput. Biol 7, el001052. 

16. Cover TM, Thomas JA (1991) Elements of information theory. New York: Wiley. 

17. Merhav N, Kaplan G, Lapidoth A, Shanrai Shitz S (1994) On information rates for mismatched 
decoders. IEEE Trans Inform Theory 40, 1953-1967. 

18. Latham PE, Nirenberg S (2005) Synergy, redundancy, and independence in population codes, 
revisited. J Neurosci 25, 5195-5206. 

19. Oizumi M, Ishii T, Ishibashi K, Hosoya T, Okada M (2010) Mismatched decoding in the brain. J 
Neurosci 30, 4815-4826. 

20. Ay N (2001) Information geometry on complexity and stochastic interaction. MPI MIS Preprint 
95. Available: http://www.mis.mpg.de/publications/preprints/2001/prepr2001-95.html 

21. Lee U, Masliour GA, Kim S, Noh GJ, Choi BM (2009) Propofol induction reduces the capacity 
for neural information integration: Implications for the mechanism of consciousness and general 
anesthesia. Conscious Cogn 18, 56-64. 

22. Chang JY, et al. (2012) Multivariate autoregressive models with exogenous inputs for intracerebral 
responses to direct electrical stimulation of the human brain. Front Hum Neurosci 6, 317. 

23. Alkire MT, Hudetz AG, Tononi G. (2008) Consciousness and anesthesia. Science 322, 876-80. 

24. Boly M (2011) Measuring the fading consciousness in the human brain. Curr Opin Neurol 24, 
394-400. 

25. Sanders RD, Tononi G, Laureys S, Sleigh J (2012) Unresponsiveness ^ unconsciousness. Anethe- 
siology 116, 1-1. 

26. Ding M, Chen Y, Bressler, SL (2006) Granger causality: Basic theory and application to neuro¬ 
science. In Sclielter S, Winterhalder N, & Tinnner J. Handbook of Time Series Analysis. Wiley, 
Wienheim. 

27. Vicente R, Wibral M, Lindner M, Pipa G (2011) Transfer entropy-a model-free measure of effective 
connectivity for the neurosciences. J Comput Neurosci 30, 45-67. 



20 


28. Seth AK, Barrett AB, Barnett L (2011) Causal density and integrated information as measures of 
conscious level. Philos Transact A Math Phys Eng Sci 369, 3748-3767. 

29. Rieke F, Warland D, de Ruyter van Steveninck R, Bialek W (1997) Spikes: exploring the neural 
code. (MIT Press, Cambridge, MA). 

30. Dayan P, Abbott LF (2001) Theoretical Neuroscience. Computational and Mathematical Modeling 
of Neural Systems. (MIT Press, Cambridge, MA). 

31. Averbeck BB, Latham PE, Pouget A (2006) Neural correlations, population coding and computa¬ 
tion. Nat Rev Neurosci 7, 358-366. 

32. Quian Quiroga R, Panzeri S (2009) Extracting information from neuronal populations: information 
theory and decoding approaches. Nat Rev Neurosci 10, 173-185. 

33. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106, 620-630. 



arXiv: 1505.04368vl [q-bio.NC] 17 May 2015 


i 


Measuring integrated information from the decoding 
perspective 

Masafumi Oizumi 1 ’ 2 ’*, Shun-ichi Amari 1 , Toru Yanagawa 1 , Naotaka Fujii 1 , Naotsugu Tsuchiya 2,3 ’* 

1 RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan 

2 Monash University, Clayton Campus, Victoria 3800, Australia 

3 Japan Science and Technology Agency, Japan 

* E-mail: oizumi@brain.riken.jp, naotsu@gmail.com 


Supporting Information 


Equivalence of <f>* with $ under the assumption of maximum entropy distri¬ 
bution 

We show that <F* proposed in the present paper is equivalent to <h, proposed by [T] when the maximum 
entropy distribution is assumed for the past state as follows. 

First, we should note that when the maximum entropy distribution is assumed, the distribution of 
past states in the whole system can be decomposed into the product of the distribution of each part as 

PC 13 x X t- r ) = [|p( maX M‘- T ). (1) 


Bearing this in mind, we can compute I* as follows 

i*(p) = -J dx t p{x t ) iogn J dMr P r-Mr) P (Mmrr 

i 

+ /3 J dX t ~ T J dX t p( max X t ~ T )p(X t \X t ~ T ) \ogY\_p(M*\M*~ T ), (2) 

= - E / log / dM$~ T p( max M-~ T )p(M*\M-~ T ) 13 (3) 

i i 


To obtain /3 that maximizes /*, we differentiate /*(/ 3) as 

di*m \ 


d(3 


where 


dMi r(Mf) d(3 )■ 

(4) 

dM‘- r p( max M‘- T )p(M‘|M‘- T ) /3 , 

(5) 

-pr™Ml- T )p{M$\Ml- T y\ogp(M$\Ml- T ), 

(6) 


PPSl=( d A 
d/3 J 

Substituting (3=1 into dI d i^ , we obtain 


= 0, 


(7) 

( 8 ) 
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We therefore find that /*(/?) is maximized when (3 = 1. Substituting /? = 1 into I*(/3), we obtain the 
expression of /* as 


/*(/3 = 1) = ^ H(Mf) - Y, (9) 

2 2 

= ^]/( max M*- r ;M‘). (10) 

2 

From Eq. na we see that our measure is equivalent to the original measure when the maximum entropy 
distribution is used. 


is 0 when parts are perfectly correlated 

We show that <f>* is 0 when parts are perfectly correlated. For simplicity, we consider a system cosisting 
of two units M\ and M 2 and the mismatched decoder, q(X t \X t ~ T ) = p(Ml\Ml~ T )p(Ml\Ml~ T ). It 
is easy to generalize to the case of more than two units. When the two units are perfectly correlated, 
M\~ T = M%~ T and M{ = M\. In this case, the joint probability distribution can be written as p(X t ~ T ) = 
p{M\~ T )p{M 2 ~ T \M\~ T ) = p{M[~ T ) 8 {M 2 ~ T — M[~ T ) where S(x) is the Dirac delta function. The first 
term of I* can be calculated as follows. 


/*(/?) = - / dX t p(M{)5{M t 2 - M{) log / dX t - r p(Mf- r )<5(M 


tr 


- M[~ T 


)\{p{Mi\Ml- 


2 — 1 


2 

+ /? J dX t ~ T j dX t p(X t ~ T , X*) log 

2=1 

= -j dM[p{Ml) log J dMl- T p{Ml- T )p{Ml\Ml~ T ) 2 P - 2 pH{M{\M{- T ). 


To obtain /3 that maximizes /*, we differentiate I*(/3) as 


dr IP) 

d/3 


f ^ t P {M{) dr(Ml) 

J dMl r(Ml) dp 


2H{Ml\M[- T ). 


where 


( 11 ) 

( 12 ) 


(13) 


r{M\) = J dM t 1 - T p(M t 1 - T )p{M t 1 \M t 1 - T ) 2f) , (14) 

= 2 J dM\~ T p{Ml~ T )p{Ml\Ml~ T ) 2 ^ \ogp(Ml\Ml~ T ), (15) 

Substituting ft = 1/2 into dl , we obtain 

dI * il3 = 1/2) = ~ 2 J dM l J dM t 1 - T p{M t 1 - T )p(Ml\M t 1 - T )logp{M t 1 \M t 1 - T ) - 2 H{M[\Ml~ T ), (16) 

= 0, (17) 

We therefore find that /*(/?) is maximized when fi = 1/2. Substituting /3 = 1/2 into J*(/3), we obtain 
the expression of I* as 


I*(P = 1/2) = H{M\) - H{Ml\Ml~ T ), 

= J(M 1 t " T ;M 1 t ). 


(18) 

(19) 
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Supplementary Figure 1. Behaviors of $* (A), $/ (B), <f># (C), mutual information I (D), and 
correlation (E) when the sterength of connections a and the strength of noise correlation c are both 
varied in a linear regression model (Eq. 12 in the main text). 


When Mi and M 2 are perfectly correlated, mutual information in the whole system is just equal to the 
mutual information in each part, i.e., I(X t ~ T ;X t ) = I(M\~ T ; M[) = /(M^ -1 ";./!^). Thus, $* becomes 0. 

= I{X t ~ T ;X t ) - I*(X t - T -,X t ), (20) 

= M{) - I{M[- T ; M[), (21) 

= 0. (22) 

Theoretical requirements for a measure of integrated information are not sat¬ 
isfied by previously proposed measures 

As we detailed in the main article, integrated information should be lower bounded by 0 and be upper 
bounded by the mutual information in the whole system /. In the main article, we used a simple linear 
regression model to demonstrate that $/ and violate these bounds. Fig. SI shows the behaviors of 
(A), (B), (C), mutual information I (D), and correlation coefficient between units (E) when 

the strength of connections a and the strength of noise correlation c are both varied in the same linear 
regression model as in the main article (Eq. 12). As we can see in Fig. SI, goes negative when the 
degree of correlation is high and thus, it does not satisfy the lower bound. is not 0 even when there 
is no information (a = 0) and thus, it does not satisfy the upper bound. By comparing the panels (C) 
and (D), which are in the same color scale, we can see that violates the upper bound when the noise 
correlation c is high. $* always satisfies both the lower bound and the upper bound and therefore, it can 
be considered as a proper measure of integrated information. 
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